Add delete file index to pyiceberg and support equality delete reads #2255

geruh · 2025-07-29T03:01:27Z

Summary

This work was primarily done by @rutb327 while I provided guidance!

This PR adds equality delete read support to PyIceberg by implementing the delete file indexing system that matches delete files to data files, mimicking the behavior found in Iceberg Core. With this implementation we are able to index files and now read equality deletes during table scans.

Design details

Delete File Index

The new DeleteFileIndex class centralizes handling of all delete file types: positional deletes, equality deletes, and deletion vectors. It organizes deletes by type (equality vs. positional), partition (using PartitionMap for spec-aware grouping), and path (for path-specific positional deletes). This enables efficient lookup during table scans, reducing unnecessary delete file processing.

Equality Delete support

Equality delete files are loaded as PyArrow Tables with their respective equality ids for the schema and for each we are grouping tables with the same set equality id's to reduce anti join operations.

Testing

Added tests from the core iceberg DeleteFileIndex test suite and added some tests with dummy files. As well as some manual testing with a flink setup.

table_eq with only equality deletes on id=2, id=5
+---+-------+
| id|   data|
+---+-------+
|  1|  Alice|
|  3|Charlie|
|  4|  David|
|  6|  Frank|
+---+-------+

table_eq_pos with equality deletes and positional delete at position 3
+---+-----+
| id| data|
+---+-----+
|  1|Alice|
|  4|David|
|  6|Frank|
+---+-----+

Are there any user-facing changes?

Yes can read tables with equality deletes

gabeiglio · 2025-07-31T13:35:33Z

I noticed that this PR addresses the same issue/feature as the one I was working on in here. However, your implementation is more complete (by supporting reading equality deletes and deletion vectors), so I think it makes sense to move forward with this one instead. (cc: @sungwy, since you reviewed my PR)

kevinjqliu · 2025-07-31T19:45:30Z

oops, sorry @gabeiglio, I was searching for positional deletes in github search and i didnt see that you were already working on it in that PR. Looks like there are some parts of the PR that is still super useful to get merged, like the validates.

gabeiglio · 2025-07-31T21:38:35Z

Yea exactly, should have been more clear on my message, my implementation for DeleteFileIndex was a scope creep to achieve the validation. so now that PR can be only for the validation instead of partition maps, delete file index, etc. :) @kevinjqliu

sungwy · 2025-08-14T01:08:31Z

pyiceberg/io/pyarrow.py

@@ -978,18 +979,23 @@ def _get_file_format(file_format: FileFormat, **kwargs: Dict[str, Any]) -> ds.Fi
        raise ValueError(f"Unsupported file format: {file_format}")


-def _read_deletes(io: FileIO, data_file: DataFile) -> Dict[str, pa.ChunkedArray]:
+def _read_deletes(io: FileIO, data_file: DataFile) -> Union[Dict[str, pa.ChunkedArray], pa.Table]:


I think the output signature and the role of this function is convoluted.

Would it make sense to have two separate functions instead?

sungwy · 2025-08-14T01:13:09Z

pyiceberg/io/pyarrow.py

+    equality_delete_tasks = []
+    for task in tasks:
+        equality_deletes = [df for df in task.delete_files if df.content == DataFileContent.EQUALITY_DELETES]
+        if equality_deletes:
+            for delete_file in equality_deletes:
+                # create a group of datafile to associated equality delete
+                equality_delete_tasks.append((task.file.file_path, delete_file))
+
+    if equality_delete_tasks:
+        executor = ExecutorFactory.get_or_create()
+        # Processing equality delete tasks in parallel like position deletes
+        equality_delete_results = executor.map(
+            lambda args: (args[0], _read_deletes(io, args[1])),
+            equality_delete_tasks,
+        )


We are already getting a subset of the files that have equality deletes, so it would make sense to use a different function to read the deletes, than using the convoluted function _read_deletes

sungwy · 2025-08-14T01:15:28Z

pyiceberg/io/pyarrow.py

-    deletes_per_file: Dict[str, List[ChunkedArray]] = {}
-    unique_deletes = set(itertools.chain.from_iterable([task.delete_files for task in tasks]))
-    if len(unique_deletes) > 0:
+def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> Union[Dict[str, pa.ChunkedArray], pa.Table]:


Should this be:

Suggested change

def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> Union[Dict[str, pa.ChunkedArray], pa.Table]:

def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> Dict[str, List[Union[pa.ChunkedArray, pa.Table]]]:

sungwy · 2025-08-14T01:15:52Z

pyiceberg/io/pyarrow.py

@@ -1679,7 +1749,7 @@ def batches_for_task(task: FileScanTask) -> List[pa.RecordBatch]:
                break

    def _record_batches_from_scan_tasks_and_deletes(
-        self, tasks: Iterable[FileScanTask], deletes_per_file: Dict[str, List[ChunkedArray]]
+        self, tasks: Iterable[FileScanTask], deletes_per_file: Union[Dict[str, pa.ChunkedArray], pa.Table]


Suggested change

self, tasks: Iterable[FileScanTask], deletes_per_file: Union[Dict[str, pa.ChunkedArray], pa.Table]

self, tasks: Iterable[FileScanTask], deletes_per_file: Dict[str, List[Union[pa.ChunkedArray, pa.Table]]]

sungwy

Hi @geruh - thanks for working on this PR, and sorry for the delayed review.

I've added some review feedback. Let me know your thoughts!

rutb327 · 2025-08-14T21:15:03Z

@sungwy Thanks a lot! I have done the suggested changes, could you take another look at it?

Babar and others added 3 commits July 28, 2025 15:56

feat: add mor delete file index support

aa8694b

code style and correctness

91655c9

fix failing tests

fdc76bc

sungwy reviewed Aug 14, 2025

View reviewed changes

francocalvo mentioned this pull request Aug 14, 2025

[feature request] Support reading equality delete files #1210

Open

_read_deletes and other changes

83a544a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add delete file index to pyiceberg and support equality delete reads #2255

Add delete file index to pyiceberg and support equality delete reads #2255

geruh commented Jul 29, 2025

Uh oh!

gabeiglio commented Jul 31, 2025

Uh oh!

kevinjqliu commented Jul 31, 2025 •

edited

Loading

Uh oh!

gabeiglio commented Jul 31, 2025

Uh oh!

sungwy Aug 14, 2025

Uh oh!

sungwy Aug 14, 2025

Uh oh!

sungwy Aug 14, 2025

Uh oh!

sungwy Aug 14, 2025

Uh oh!

sungwy left a comment

Uh oh!

rutb327 commented Aug 14, 2025

Uh oh!

Uh oh!

	def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> Union[Dict[str, pa.ChunkedArray], pa.Table]:
	def _read_all_delete_files(io: FileIO, tasks: Iterable[FileScanTask]) -> Dict[str, List[Union[pa.ChunkedArray, pa.Table]]]:

	self, tasks: Iterable[FileScanTask], deletes_per_file: Union[Dict[str, pa.ChunkedArray], pa.Table]
	self, tasks: Iterable[FileScanTask], deletes_per_file: Dict[str, List[Union[pa.ChunkedArray, pa.Table]]]

Add delete file index to pyiceberg and support equality delete reads #2255

Are you sure you want to change the base?

Add delete file index to pyiceberg and support equality delete reads #2255

Conversation

geruh commented Jul 29, 2025

Summary

Design details

Delete File Index

Equality Delete support

Testing

Are there any user-facing changes?

Uh oh!

gabeiglio commented Jul 31, 2025

Uh oh!

kevinjqliu commented Jul 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gabeiglio commented Jul 31, 2025

Uh oh!

sungwy Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

sungwy Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

sungwy Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

sungwy Aug 14, 2025

Choose a reason for hiding this comment

Uh oh!

sungwy left a comment

Choose a reason for hiding this comment

Uh oh!

rutb327 commented Aug 14, 2025

Uh oh!

Uh oh!

kevinjqliu commented Jul 31, 2025 •

edited

Loading